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6>0 Overview of the Unit 

In this Unit, some of the issues involved in standard setting 
along with methods for standard-setting are reviewed. The review 
will draw on the work of Millman, Meskauskas, and Glass and incor- 
por..te niany of the newer standard-setting methods. The standard- 
setting methods are organized into three categories, judgmental 
methods, empirical methods, and combinations of judgment and empirical 
methods. Procedures for setting standards to accomplish three 
primary uses of criterion-referenced testing art^ discussed in a final 
section of the paper. 
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6.1 Introduction 

In a racent review of the criterion-referenced testing field, 
Hambleton, Swamlnathan, A.lglna, and Coulson (1978) delineated two 
major uses for test scores derived from criterion-referenced tests: 
dom£:in score estimation and the allocation of examinees to mastery 
states. The second use, the allocation of examinees to mastery 
states, requires the setting of a performance standard, or cut-off 
score. 

Based upon an individual's score on a test, where the test is 

a representative sample of the subject domain, a mastery/non-mastery 

decision concerning the domain from which the item sample was drawn 

is sought. Millman (1973) summarizes the situation well: 

Of interest is the proportion of such items a 
student can pass. It is assumed that some edu- 
cational decision, e.g., the nature of subsequent 
instruction for the student, is conditional upon 
whether or not he exceeds a proficiency standard 
when administered a sample of items from the 
domain. Thus, attention is directed toward the 
individual examinee and his performance relative 
to the standard rather than toward producing 
indicators of group performance. 

Thus, it can be seen that in this criterion-referenced testing 

situation, a cut--off score (there can he multiple cut--off scores 

on the domain score scale although usually only one is set) must 

be set, in order to make a decision about an individual's mastery 

status. The results of this decision wil3 depend upon the context 

within which the test is being used. \s an example, consider the 

Mastery Learning paradigm (Block, 1972). In this situation, if a 

stude.it 's score exceeds the cutting score, he/she is advanced to 

the next unit of instruction. If the student's score falls below 
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the standard, remedial activitios arc prosoribr<K M is (mpon.uU 
to understand that the decision being made is on the level of the 
individual, and as such, the status of other individuals does not 
enter into the decision. As a second example, consider the use of 
criterion-referenced tests to provide test data relative to a set 
of basic skills which students must demonstrate mastery of (i.e., 
achieve specified levels of performance) in order to graduate from 
high school. In this context, decisions are very important because 
whether or not students can graduate will depend on their criterion- 
referenced test score performance and the resulting master /non-mastery 
decisions which are made. 

These situations can be contrasted with the setting of standards 
for norm-referenced tests, which is considerably less complex. Since 
for tests constructed to yield norm-referenced interpretations, an 
individual is compared to others, it makes sense to set a passing or 
cut-off score so that a certain percent of the students pass. If, 
for instance, only 20% of the students taking an exam can be placed 
in an enrichment program, then a passing score that passes 20% of the 
students would make sense. 

Given what has just been said about tha importance of cut-off 
scores for proper criterion-referenced test score usage, one would 
think that this would be well-researched and documented area. This 
is simply not the case. Most of the work done to date has been con- 
cerned with the suggestion of possible methods, perhaps twenty-five 
in number, rather than with actual empirical investigations. In 
addition to the individual work done, there have been two excellent 



reviews of cut-score procedures advanced (Millman* 1973; Meskauskas 
1976), and one recent review that was highly critical of the field 
(Glass, 1978a, 1978b). 
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6.2 Some Issues In Standard Setting 

Que of the primary purposes of criterion-referenced testing is 
to provide data for decision-making. Sometimes the decisions are made 
by classroom teachers concerning the monitoring of student progress 
through a curriculum. On other occasions, promot^^ion, certification and/or 
graduation decisions are made by school, district, and state adminis- 
trators. 

Glass (1978a) was rather critical of 
measurement specialists for giving too little attention to the prob- 
lem of determining cut-off scores [he notes, "A connion expression of 
wishful thinking is to base a grand scheme on a fundamental, unsolved 
proble:n." (p. 1)]. On the othfir hand, a considerable amount of criterion- 
referenced testing research har. been done. Not all uses of criterion-referenced 
tests require cut-off scores (for example, description), and moreover, the 
problem does not really arise until a criterion-referenced test has 
baen con*- tructeJ. Also, It should not be forgotten that problems 
associateJ with cut-off scores aie difficult and so solutions are 
going to require laor^ time. 

6.2.1 Uses of Cut-off Scores 
in Decision Making 

A "cut-off score" is point on a test score scale that is 

used to "sort" ex*imlnees into two categories x^hich reflect different 

levels of proficiency relative to a particular objc^cLive measured by 

a tost. It is common to ^ssie.n labels such as "masters" and "non- 

masteru" to exanlnocs assigned co the two catc^oriec. It is nor 

unusual either to arrJgn exarulnces to more than two categories based 




on their test performance (i.e., sonetimes multiple cut-off scores are 
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used) or to use cut-off scores that vary from one objective to 
another (this may be done when it Is felt that a set of objectives 
differ In their Importance). 

It Is important at this point to separate three types of 
standard s or cut-off scores. Consider the folloWing statement: 

School district A has set the following target — 
It desires to have 85% or more of its students 
in the second grade ochieve 90% of the reading 
objectives at a standard of performance equal 
to or better than 80%. 

Three types of standards are Involved in the example: 

1. The 80% standard is used to interpret examinee perfor- 
mance or each of the objectives measured by a test. 

2. The 90% stand: rd is used to interpret exanJnee perfor- 
mance across all of the objectives measured by a test. 

3. The 8!)% standard is applied to the performance of second 
gradeis on the set cf objectives measured by a test. 

In this unit, only the first use of standards or cut-off scores will 
be considered. 

lu what fol low's it Is important to separate the theoretical 
arguments for or against the uses of cut-off scores from the uses and 
misuses of cut-of: scorer, in piactical settings. For example, it is well-knox^n 
that cut-off scorer, are cfLcn "pulled from the Pir" or set to (say) 80% 
because that as Ll.c vmUk another school district is using. Rut, 
the fact that cut-of" sco:e:; are being doternined in a highly ii, ap- 
propriate way is obviously noi grounds for rejecting the concept of a 
"cut-off score." Ii the concept is appropriate for some particular 
use of a criterion-referenced test, the task becomes one of training 
ERiC people to set and to use cat-off c;cores properly (Hambleton, 1978). 
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Four questions with respect to the use of cut-off scores with 
criterion-referenced tests require answers: 

1. Why are cut-6ff scores needed? 

2. What methods are available for setting cut-off scores? 

3. How should a method be selected? 

4. What guidelines are available for applying particular 
methods successfully? 

1. Why are cut-off scores needed? 
An answer to the question depends on the intended use (or u&es) 
of the test score inform.. :ion. Consider first objectives or competency- 
based programs since it is with these types of programs that criterion- 
referenced tests and cut-off scores are most often used. Objectives-based 
prograips, in theory are designed to improve the quality of instruc- 
tion by (1) defining the curricula in terms of objectives, (2) re- 
lating instruction and assessment closely to the objectives, (3) making 
it possible for individuali7ation of instruction, and (4) providing 
for on-going evaluation. Hard evidence on the success of objectives- 
based programs (or most new programs) is in short supply but there is 
some evidence to suggest that when objectives-based programs are im- 
plemented fully an d properly they are better than more "traditionally- 
oriented" curricula (Klausmeier, Rossmiller, & Saily, 1977; Torshen, 
1977). Individualization of instruction is "keyed" to descriptive 
information provided by criterion-referenced tests relative to examinee 
performance on test items measuring objectives in the curriculum. 
But descriptive information such as "examinee A has a.iswered correctly 
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85% of the test items measuring a particular objective" must be e/al- 
uated and decisions made based upon that interpretation. Has a student 
demonstrated a sufficiently high level of performance on an objective 
to lead to a prediction that she/he has a good chance of success on 
the next objective in a sequence? Does a student's performance level 
indicate that he/she may need some remedial work? Is the student's 
performance level high enough to meet the target for the objective 
defined by teachers of the curriculum? In order to answer these and 
many other questions ic is necessary to set standards or cut-off scores . 
How else can decisions be made? Comparative statements about students 
(for example, Student A performed better than 60% of her classmates) 
are largely irrelevant. Carefully developed cut-off scores by qualified 
teams of experts can contribute substantially to the s access of an 
object iveji-based program (competency-based program or basic skills 
program) because cut-off scores provide a basis for effective decision- 
making. 

There has also been criticism (Glass, 1978a) of the use of cut- 
off scores with "life skills" or "survival skills" tests. The are 
terms currently popular with State Departments of Education, School 
Districts, Test Publishers, and the press. Of course, Glass is correct 
when he notes that it would be next to impossible to validate the classi- 
fications of examinees into "mastery states", i.e., those predicted to 
be "successful" or "unsuccessf uJ' in life. On the other hand, if what 
is really meant by the term "lile skills" (say) is "graduation require- 
ments," then standards of performance for "basic skills" or "high school 
competency" tests can probably be set by appropriately chosen groups of 
individuals (Millman, personal communication). 
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2. If cut-off scores are needed, what 

methods are available for setting them? 

Numerous researchers have catalogued many of the available methods 
(Hambleton & Eignor, 1979; Hambleton et al., 1978; Jaeger, 1976; 
Miilraan, 1973; Meskanskas, 1976; Shepard, 1976). Many of these methods 
have also been reviewed by Glass (1978a). It suffices to say here 
that there exist methods based on a consideration of (1) item content, 
(2) guessing and item sampling, (3) empirical data from mastery and 
non-mastery groups, (4) decision-theoretic procedures, (5) external 
criterion measu-es, and (6) educational consequences. These methods 
will be considered in detail in sections 6.6, 6.7, and 6.8. 

What is clear is that all of the methods are arbitrary and 
this point has been made or implied by everyone whose work we have 
had an opportunity to read. The point is not disputed by anyone we 
are aware of. But as Glass (1978a) notes, "arbitrariness is no bogey- 
man, and one ought not to shrink from a necessary task because it 
involves arbitrary decisions" (p. 42). Popham (1978) has given an 
excellent answer to the concern expressed by some researchers about 
arbitrary standards: 



Unable to avoid reliance on human judgment as 
the chief ingredient in standard-setting, some 
individuals have throvm up tholr hands in dismay 
and cast aside all efforts to set performance 
standards as arbitrary, hence unacceptable. 

But W ebster's Dictionar y offers us two 
definitions of arbitrary. Tne first o' these is 
positive, describing arbitrary as an adjective 
reflecting choice or discretion, that is, "deter- 
minable by a judge or tribunal." The second 
definition, pejorative in nature, describes 
arbitrary as an adjective denoting capriciousness, 
that is, "selected :>r random and without reason." 
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In my estimate, v/hen people start knocking the 
standard-setting game as arbitrary, thev are 
clearly employing Webster's second, negatively loaded 
definition. 

But the first definition is more accurately 
reflective of serious standird-setting efforts. 
They represent ge^ . ine attempts to do a good job 
in deciding whnt ) mds of standards v/e ought to 
employ. That they are judgmental is inescapable. 
But to malign all judgmental operations a3 capri- 
cious is absurd. (p. 168) 

And, in fact, much of what we do is arbitrary in the positive sense of the 
word. We set fire standards, health standards, environmental standards, 
highway safety standards, (even standards for the operation of nuclear reactors), 
and 60 on. And in educational settings, it is clear that teachers make 
arbitrary decisions about what to teach in their courses, how to teach 
their material, and at what pace they should teach. Surely, if teachers 
arc deemed qualified to make these other important decisions, they are 
eq..3lly qualified to set standards or cut-off scores for the monitoring 
of student I ogress in their courses. But what if a cut-off score is 
set too high (or low) or students are nisclassif ied? Through experience 
with a curriculum, with high quality criterion-referenced tests, and 
with careful evaluation work, standards that are not "in line'' with 
others can be identified and revise!. And fur students v/ho are mis- 
classified there arc some redeeming features^ Those that perform below the 
standard will be assigned remedi-1 work and the fact that they performed below 
the cut-off score suggests that they could not too far above it (this 
vould be true for most of the rtudents about whom false-negative errors 
made) and so the review period will not be a total waste of time. 

I.) 
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And for those students who are raisclassif ied because they scored above 
a cut-off score, they will be tested again. It is possible the next 
time the error will be caught (particularly if the objectives are 
sequential). A conment by Ebel (1978) is particularly appropriate 
at this point: 



PcTss-fail dccislcns on a pc-rsonVs achievt-raont 
In learning trouble some measurerr.ont special irts 
a j;reat deal. They knov about errors of inea.suue- 
mcnt. They kno\/ that so:ne who barely nnss do so 
on]y with the help of errors of measurement. They 
know that some who fail do so only with the hiadiar.ce 
of errors of neasurement . For those, passing or 
failing does not depend on achievcr.enl at all. Tc 
depends oi^y on luck. That seems unfair, and indeed 
it is. But, as any i-easerenent specialist can explain, 
it is also entirely uaavo id.ible . >!akc a better test 
and we reduce the nunber who will be passed cr fiiled 
by error. But the nu.il.er can never be reduced 
zero. (p. 5/<9) 



The consequences of false-positive and false-negative errors 
wiL> basic skills assessment or high school certification tests are 
however considerably more serious and so more attention must be piven 
to the A sign of these testing programs (for example, content covered 

tr.ti >, the timing of tests, and decisions made with the test 

results). Considerably more effort must also be given to test devel- 
opment, content validation, and setting of standards. 



.3 Ji^^o y^y__a_^mciyi^r^^^ e selected? 

There are many factors to consider in selcct/ng a method to 
determine cut-off scores. For example. 
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1. How important are the decisions? 

2. Hov; much time is available? 

3. \n\at resources arc available to do the job? 

4. How capable are the appropriate individuals of applying 
a particular meti,^d successfully? 

Thr. most Interesting workwe have seen to date regarding the 
selection of a method was offered by Jaeger (1976), He considers 
several methods for determining cut-off scores, several approaches 
for assigning examinees to mastery states, and various tnreats to the 
validity of assignments, while Jaeger's v'ork is theoretic, it provides 
an excellent starting point for anyone interested in initiating research 
on the merits of different methcxds. One thing seems cleror from his 
work — all of the methods he studied appear to have numerous potential drawbacks 
and so the selection of a method in a given situation she aid be made carefully. 

A. What guidelines are available fo r 
app lying particular methods suc- 
cessfully? 

Unfortunatc?ly, there are relatively few sets of guidelines 
available for applying any of tho methods. In our judgment, Zieky and 
Livingston (1977) have provided a very helpful set of gj.'delines for 
applying several mcthodr. (the popular Kedclsky method and the yVngoff 
method are two of the methods included). Some new work by Popham (1978) 
is also very helpful. More materials of this type and quality are 
needed. Some procedural steps for standard-setting with respect to three 
important uses of tests — (1) daily classroom assessment, (2) basic skills 
assessment for yearly promotionp and high school certification, and (3) 
professional licensing and certification are provided in section 6.9, 

ERLC 
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6.3 Distinction Between Continuum and State Models 

The basic difference between continuum and state models has to do 
with the underlying assumption made about ability. According to Meskauskas, 
two characteristics of continuum models are: 

1. Mastery is viewed as a continuously distributed ability or set 
ct abilities. 

2. An area is identified at the upper end of this continuum, and 
if an individual equals or exceeds the lower bound of thi? 
area, he/she is termed a master. 
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State models, rather than being based on a continuum of mastery, view 
mastery as an all-or-none proposition (i.e., either you can do some- 
thing or you cannot). Three characteristics of state models are; 

1. Test true-score performance is viewed as an all-or-nothing 
state. 

2. The standard is set at 100%. 

3. After a consideration of measurement errors, standards are 
often set at values less than 100%. 

There are at least three methods for setting standards that are 
built on a state model conceptualization of mastery. The models take 
into account measurement error, deficiencies of the examination, etc., 
in "tempering" the standard from 100%. These methods have been referred 
to by Glass (1978a) in his review of methods for setting standards as 
"counting backwards from 100%." State model methods advanced to date 
include the mastery testing evaluation model of Emrick (1971), the 
true-score model of Roudabueh (1974), and some recently advanced statis- 
tical models of Macready and Dayton (1977). However, since state 
models are somewhat less usefulness than continuum models in elementary 
and secondary school testing programs, they vill not 

be considered further here. 0.^ failure to consider them fur- 

ther however, should not be interpreted as a criticism 

of this general approach to standard-setting. The approach seems to 
be especially applicable with many performance tests (Hambleton & Simon, 
in preparation) . 
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6.4 Traditional and Normative Procedures 

Before discussing the various continuum models of standard 

setting, two other models for standard-setting should be mentioned. 
These methods, which seem to have limited value in setting 

standards, have been referred to by a variety of names. 
We will call them "traditional standards" and "normative standards." 

Traditional ^andards are standards that have gained acceptance 
because of their frequent use. Classroom examples include the decision 
that 90 to 100 percent is an A, 80 to 89 percent is a B, etc. It appears 
that such methods have been used occasionally in setting standards. 

"Normative" standards refer to any of three different uses of 
normative data, two of which are, at best, questionable. In the first 
method, use is made of the normative performance of some external 
"criterion" group. As an example, Jaeger (1978) cites the use of the 
Adult Performance Level (APL) tests by Palm Beach County, Florida schools. 
Test performance of groups of "successful" adults were used to set 
standards for high school students. Such a procedure can be 
criticized on a number of grounds. Jaeger (1978) points out that 
society changes, and that standards should also change. Standards 
based on adult performance may not be relevant to high school students. 
Shepard (1976) points out that any normatively-determined standard will 
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immediately result in a multitude of counterexamples. Further, Burton 
(1978) suggests that relationships between skills in school subjects 
and later success in life are not readily determinable, hence, observing 
the degree of achievement on the test of some "successful" norm group 
makes little sense. Jaeger (1978) goes on to say: "There 
are no empirically tenable survival standards on school-based skills 
that can be justified through external means." 

A second way of proceeding with normative data is to make a 
decision about a standard based solely on the distribution of scores 
of examinees who take the test. Such a procedure circumvents the 
"minimum test score for success in life" problem, but the procedure 
is still not useful for setting standards. For example. Glass (1978a) 
cites the California High School Proficiency Examination, where the 50th 
percentile of graduating seniors served as the standard. What can 
be said of a procedure where whether or not an individual passes or 
fails a minimum competency test depends upon the other individu?»ls 
taking the test? In the California situation, the standard was set 
with no reference at all to the content of the test or the difficulty 
of the Lest items. 

The third use of normative data discussed in the literature 
concerns the supplemental use of normative data in setting a standard. 
Shepard (1976), Jaeger (1978;. and Conaway (1976, 1977) all favor such 
a procedure. Recently Jaeger xi978) advanced a standard setting method which 
requires judges to make judgments partially on the basis of item content. 
In his method, Jaeger calls for incorporation of some tryout test data 
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to aid judges in reconsidering their initial assessments. Shepard 
(1976) makes the following point: 

Expert judges ought to be provided with normative 
data in their deliberations. Instead of relying 
on their experience, which may have been with un- 
usual students or professionals, experts ought to 
have access to representative norms. . .of course, 
the norms are not automically the standards. Ex- 
perts still have to decide what "ought" to be, but 
they can establish more reasonable expectations 
if they know what current performance is than if 
they deliberate in a vacuum. 

We agree with Jaeger, Conaway, and Shepard about the usefulness 

of normative data when used in conjunction with a standard setting 

method. 
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6>5 Consideration of Several Promising 
Standard Setting Methods 

Remaining methods for setting standards to be discussed In this unit 
assume that domain score estimates derlv'ed from criterion-referenced tests 
are on a continuous scale (hence, the methods fall under the heading of 
''Continuum Model*'). For convenience, the methods under discussion are 
organized Into three categories. The methods are presented In Figure 
6.5.1. The categories are labelled "judgmental," "empirical," and 
"combination." In judgmental methods, data are collected from judges for 
setting standards, or judgments are made about the presence of variables 
(for example, guessing) that would effect the placement of a standard. 
Empirical methods require the collection of examinee response data to aid 
in the standard-setting process. Combination methods, not surprising, 
incorporate judgmental data and empirical data into the standard-setting 
process. 



Figure 6.5 A 



A classification of methods for setting standards^ 



Judgmental Methods 



Coifibination Methods 



Empirical Methods^ 



Item Content 

Nedelsky (1954) 

Modified Nedelsky 
(Nassif, 1978) 

Aiigoff (1971) 

Modified Angoff 
(ETS. 1976) 

Ebel (1972) 

Jaeger (1978) 



Guessing 
Millman (1973) 



Judgmental- 
Empirical 

Contrasting Groups 
(Zleky and Living- 
ston, 1977) 

Borderline Groups 
(Zleky and Living- 
ston, 1977) 



Educational 
Consequence s 

Block (1972) 



Bayeslan Methods 

Hambleton and Novlck (1973) 

Novlck, Lewis, Jackson (1973) 

Schoon, Gulllon 
Ferrara (1978) 



^Involve the use of examinee response data* 

From a paper by Hambleton and Eignor (1979). 



Data — Two 
Groups 

Berk (1976) 



Data-Criterion 
Measure 

Livingston (1975) 

Livingston (1976) 

Huynh (1976) 

Van dcr Linden 
and Mellenbergh 
(1977) 



Decision-Theoretic 
Kriewall (1972) 
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6 .6 Judgmental Methods 
6.6.1 Item Content 

In this situation, individual items are inspected, with the level 
of concern being how the minimally competent person would perform on 
the items. In Owher words, a judge is asked to assess how or to what 
degree an individual who could be described as minimally competent would 
perform on each item. It should be noted before describing particular 
procedures utilizing this criterion that while this is a good deal more 
objective than setting standards based on any of the methods previously 
disrussed, a considerable degree of subjectivity still exists. Six pro- 
cedures based on item content assessment will now be discussed. 

i. Nedelsky Method 
In Nedelsky's method, judges are asked to view each question in a 
test with a particular criterion in mind. The criterion for each question 
is, which of the response options should the minimally competent student 
(Nedelsky calls them "D-F students") be able to eliminate as incorrect? 
The minimum passing level (MPL) for that question then becomes the reci- 
procal of the remaining alternatives. For instance, xf on a five-alternative 
multiple choice question, a judge feels that a minimally competent person 

could eliminate two of the options, then for that c ^stion, MIL » i. The 

3 

judges proceed with each question in a like fashion, and upon completion 
of the judging process, sum the values for each question to obtain a 
standard on the total set of test items. Next, the individual judge's 
standards are averaged. The average is denoted tt^. 

Nedelsky felt that if one were to compute the standard deviation of 
individual judge's standards, this distribution would be synonomous with 

2\ 
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the (hypothesized or theoretical) distribution of the scores of the border- 
line students. This standard deviation, o, could then be multiplied by a 
constant K, decided upon by the test users, to regulate how many (as a 
percent) of the borderline students pass or fail. The final formula 
then becomes: 

'0 = ^0^^^ 

How does the K a term work? Assuming an underlying normal distribu- 
tion, if one sets K«l, then 84% of the borderline examinees will fail. 
If K=2, then 98% of these examinees will fail. If K=0 , then 50% of the 
examinees on the borderline should fail. The value for K is set by (say) 
a committee prior to the examination. 

The final result of the application of Nedelsky's method will be 
an absolute standard. This is because the standard is arrived at without 
consideration of the score distributions of any reference group. in fact, 
the standard is ai rived at prior to using the test with the group one is 
concerned with testing. 

The following example is included to demonstrate how the Nedelsky 
method can be applied in a criterion-referenced testing situation. 

Example : Suppose five judges were asked to score, using the Nedelsky 
method, a six question criterion-referenced test made up of questions 
that have five response options each. Further, suppose the judges agreed 
that they would like 84% of the "D-F** or minimally competent students to 
fall (i.e., they set K«+l). The calculations below show tlie steps neces- 
sary to calculate a cut-off score for the test. 
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Judge 


1 


2 


Test 
3 


4 


5 


6 


V^UL ULL OCULC J.I.U1U 

Each Judge 


A 


.25 


.33 


.25 


.25 


.00 


.33 


1.41 


B 


.25 


.50 


.25 


.50 


.25 


.33 


2.08 


C 


.33 


.33 


.25 


.33 


.25 


.33 


1.82 


D 


.25 


.33 


.25 


.33 


.25 


.33 


1.74 


E 


.00 


.50 


.25 


.33 


.00 


.25 


1.33 


Average Cut-Off Score 


(Across Five Jidges) 


= 1.41 


+ 2. 


08 + 1.81 + 1.74 + 1 



- 1.68 



Standard Deviation of the Cut^Off Scores - / (1-41^3 -68)^-^2.08^1.68)^-^.. .-^(1.33^1.68)^ 



.380 



.28 



Adjusted Cut-Off Score (84% of Borderline « 1.68 -I- 1 x .28 
Student to Fail) 

- 1.96 



Therefore, approximately two test items out of six is the cut-off 
score on this test. From a practical standpoint, this value would seem 
low, but the data is created to demonstrate the process and not to model 
a real testing situation. Therefore, no practical significance should be 
attached to the answer. 
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ii. Modified Nedelsky 

Nassif U978), in setting standards for the competency-based teachers 
education and licensing systems in Georgia, utilized a modified Nedelsky 
procedure. A modification of the Nedelsky method was needed to handle 
the volume of items in the program. In the modified Nedelsky task, the 
entire item (rather than each distractor) i^j examined and classified in 
terms of two levels of examinee competence. The f ollowir g question was 
asked about each item: "Should a person with minimum competence in the 
teaching field be able to answer this item correctly?" Possible answers 
were "yes," "no," and "I don't know." Agreement among judges can be 
studied through a simple comparison of the ratings judges give to each 
item. A standard rrvay be obtained by computing the average number of "yes" 
responses judges give to the entire set of test items. 
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111. Ebel's MPthori 

Eb.l (1972) goe. about arriving at a standard In a 
•omewhat diff.rent Banner, but his procedure Is also based upon the test 
queation. rather than an "outside" distribution of scores. Judges are asked 
to rate Itetns along two dimensions: Relevance and difficulty. Ebel uses four 
ctegorlea of relevance: Essential. Important, acceptable and questionable. He 
-es three difficulty levels: Easy, medium and hard. These categories then form 
(m this case) a 3 X 4 grid. The judges are next asked to do two things: 

1. Locate each of the test questions In the proper cell, based upon 
relevance and difficulty, 

2. Aaaign a percentage to each cell; that percentage being the percentage 
of it«n8 in the cell that the minlaally-quallf led examinee should be 
able to answer. 

Then the number of questions In each cell Is multiplied by the appropriate 
percentage (agreed upon by the judges), and the sum of all the ceUs, when 

divided by the total number of questions, yields the standard. 

The example that follows is modeled after an example offered by 

Ebel (1972). 

Suppose that for a 100 item test, five judges came to the 
following agreement on percentage of success for the minimally qualified candidate. 



Difficnlty Level 
Relevance Easy Medium 



Hard 



807. 

70% 



Essential 100%* 

Important 90% 

Acceptable 90% ^0% 307 

Questionable 70% 50% 20% 



*Thr exprcti-.l pprcfiU .-.)•,(• of prisiiinr, for ir.'his it, rh- ..Uof.ni/. 
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Comblnlng this data with the judges location of test questions In 
the particular cells would vleld a table llko the fol lowing: 



Item 

Category 



ESSh.NTIAL 

Easy 
Medium 

II IPO RIANT 

Easy 
Medium 

ACCEPTABLE 

Easy 

Medium 

Hard 

QUESTIONABLE 

Easy 

Medium 

Hard 



TOTAL 



Nitni!»«r of 
Items** 



85 
55 



123 
103 



21 
A3 
50 



2 
8 

10 



Expect c'd 
Success 



100 
80 



90 
70 



90 
AO 
30 



70 
50 
20 



Number X 
Success 



500 



0500 
AAOO 



11070 
7210 



1890 
1720 
1500 



lAO 
AOO 
200 

37030 



37030 
500 



r- - 74 



*The number of Items placed in each category by all five of the Judges. 
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Three comments can be made about Ebel's method that, should be sufficient 
to suggest caution when using it. One, Ebel offers no prescription for the 
number or type of descriptions to be used along the two dimensions. This 
is left to the judgment of the individuals judging the items. It is 
likely that a different set of descriptions applied to the same test 
would yield a different standard. Two, the process is based upon the de- 
cisions of judges, and while the standard could be called absolute, in that 
it is not referenced to score distribution, it can't be called an "objec- 
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tlve" standard. Three, a point about Ebel's method has been offered by 
Mealcauakaa (1976): 

In Ebd's method, the judge must simulate the decision 
process of the examinee to obtain an accurate judgment 
and thus set an appropriate standard. Since the judge 
Is more knowledgeable than the minimally-qualified 
Individual, and since he Is not forced to make a decision 
about ec.h of the alternatives, it seems likely that the 
judge would tend to systematically over-simplify the 
examinee^ task . . . Even If this occurs only occasionally, 
It appears likely that, in contrast to the Nedelsky method, 
the Ebsl method would all(r^ the raters to Ignore some of 
the finer discriminations that an examinee needs to make 
and would result in a standard that is more difficult to 
reach, (p. 13iO 

Iv. Angoff*s Method 

When using Angoff's technique, judges are asked to assign a probability 
to each testitam directly, thus circumventing the analysis of a grid or the 
analysis of response alternatives. Angoff (1971) states: 

. . .ask each judge to state the probability that the 
minimally acceptable person* would answer each item 
correctly. In effect, the judges would think of a 
number of minimally acceptable persons instead of only 
one such person, and would estimate the proportion of 
minimally acceptable persons who would answer each item 
correctly. The sum of these probabilities, or propor- 
tions, would then represent the ulnlmallv acceptable 
score. (p. 515) 



V. Modified Angoff 

ETS (1976) utilized a modification of Angoff's method 

for setting standards. Based on the rationale that the task of 
assigning probabilities may be overly difficult for the items to be 
assessed (National Teacher Exams) Educational Testing Service 
instead supplied a seven point scale on which certain percentages were 



fixed. Judges were asked to estinuite the percoutaa<» of mlndiwllv 
knowledgeable examinees who would know the answer to each test Item, 
The following scale was offered: 

5 20 40 60 75 90 95 DNK 

where "DNK" stands for "Do Not Know." 

ETS has also used scales with the fixed points at somewhat dif^'arent 
values; the scales are consistent though in that seven choice points are given. 
For the Insurance Licensing Exams, 60 was used as the center point, 
since the average percent correct on past exams centered around 60%, 
The other options were then spaced on either side of 60. 



vi. Jaeger's Method 

Jaeger (1978) recently presented a method for standard-setting on the 
North Carolina High School Competency Test. Jaeger's method incorporates 
a number of suggestions made by particinuats in a 1976 NCME annual neeting 
symposium presented in San Francisco by Stoker, Jaeger, Shepard, Conaway, 
and Haladyna; it is iterative, uses judges trom a variety of back- 
grounds, and employs normative data. Further, rather than asking a 
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questlon involving "minimal competence," a term which is hard to opora- 
tionalize. and conceptualize, Jaeger's questions are instead: 

"Should every high school graduate be able to answer 

this iteip correctly?" " Yes, ^No." and 

"If 3 student does not answer this item correctly, 
should he/she be denied a high school diploma?" 
" ^Yes, ^No." 

After a series of iterative processes involving judges from various areas 
of expertise, and after the presentation of some normative data, 
standards determined by all groups of judges of the same type are 

pooled, and a median is computed for each type of judge. The minimum 
median across all groups is selected as the standard. 

Comparisons Among Judgmental Models 

We are aware of two studies that compare judgmental methods of 
setting standards; one study was done in 1976, the other is pre-- 
sently underway at ETS. 

In 1976, Andrew and Hecht carried out an 
empirical comparison of the Nedelsky and Ebel methods. In that 

study, judges met on two separate occasions to set standards for a 
180 item, four options per item^ exam to certify professional workers. 
On one occasion the Nedelsky method was used. On a second occasion the Ebel method 
was used. The percentage of test item that should be answered correctly 
by a minimally competent examinee was set at 69% by the Ebel method and 
at 46% by the Nedelsky method. 

3^ 
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Glass (1978a) described the observed difference as a "startling fliulln^". 
Our view is that since directions to the judges were different, and 
procedures differed, we would not expect the results from these two 
methods to be similar. The authors themselves report: 



It is perhaps not surprising that two procedures 
which involve different approaches to the eval- 
uation of test items would result in different 
examination standards. Such examination standards 
will always be subjective to some extent and will 
involve different philosophical assumptions and 
varying conceptualizations. (p. 49) 



Possibly the most important result of the Andrew-Hecht study 

was the high level of agreement in 
the determination of a standard using the same method across two teams 
of judges. The difference was not more than 3.4% within each method. 
Data of this kind address a concern raised by Glass (1978a) about 
whether judges can make determinations of standards consistently and 
reliably. At least in this one study, it appears that tney could. 
From our interactions with staff at ETS who conduct teacher workshops 
on setting standards, we have learned that teams of teachers working 
with a common method obtain results that are quite similar. And this 
result holds across tests in different subject matter areas and at 
different grade levels. We have observed the same result in our own 
work. Of course, certain conditions must be established if agreement 
among judges is to be obtained. Essentially, it is necessary that the judges 

share a common definition of the "minimally competent" student and fully 

understand the rating process they are to use. 



Ebel (1972) makes a similar point: 



. . .it is clear that a variety of approaches can 
be used to solve the problem of defining the pass- 
ing score. Unfortunately, different approaches are 
likely to give different results. (p. 496) 
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6*6.2 Guessing and Item Sampling 

In this section, some concerns initially expressed by Millinan (1973) 
about errors due to guessing and item sampling will be discussed. 

If the test items allow a student to answer questions correctly by 
guessing, a systematic error is introduced into student domain score esti- 
mates. There are three possible ways to rectify this situation: 

1. The cut-off score can be raised to take into account the con- 
tribution expected from the guessing process. 

2. A student's score can be corrected for guessing and then the 
adjusted score compared to the performance standard. 

3. The test itself can be constructed to minimize the guessing process. 
Methods one and two assume that guessing is of a pure, random nature, 

which is not likely to be the case for criterion-referenced tests. Thus, 
adjusting either the cutting score or the student's scores will probably 
prove to be inadequate. The test must be structured to keep guessing to a 
minimum, because if it occurs, it can't be adequately corrected for. 

Also, if because of problems of test construction, inconvenience of 
administration, or a host of other problems, the test is not representative 
of the content of the domain, then Millman (1973) suggests that the cutting 
score or standard be raised (or lowered) an amount to protect against 
misclassif ication of students; i.e., false-positive and false-negative 
errors. Millman offers no methods for determining the extent or direction 
of correction for these problems. We feel that the test practitioner should 
axert extra effort to assure that the problem just discussed doesn't occur 
in the first place. Once again, there doesn't appear to be an adequate 
method for ^'correcting away*' the problem. 

3'i 
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6.7 Empirical Methods 

6.7. 1 Data From Two Groups 

Berk (1976) presented a method for setting cut-off scores that is 
based on empirical data. He selects empirically the optimal cutting 
score for a test based upon test data from two samples of students, one 
of which has been instructed on the material, and the other uninstructed. 
Before discussing his methodology, where he offers three ways of proceeding 
based upon the data collected, it is vorth discussing why he chose to 
formulate his model in the first place. He suggests that the extant ap- 
proaches of a nature similar to his, namely those based on the binomial 
distribution and those based upon Bayesian decisiun-theory, suffer from 
a deficiency. According to Berk: 



The fundamental deficiency of all of these methods 
is their failure to define mastery operationally 
in terms of observed student performance, the 
objective or trait being measured, and item and 
test ch-^racteristics. The criterion level or 
cutting score is generally set subjectively on 
the basis of "judgment" or "experience" and the 
probabilities of Type 1/Type II classification 
errors associated with the criterion are estimated. 



One of Berk's procedures considers false-positive and false-negative errors, 
buc the difference is that the results are based upon actual data. 

Berk offers three ways of approaching the problem of setting standards 
utilizing empirical data: (1) Classification of outcome probabilities, 
(2) computation of a validity coefficient, and (3) utility analysis. 



Two criterion groups are selected for use in this procedure, one group 
comprised of instructed students and another of uninstructed students. 
The instructed group should, according to Berk, "consist of those students 
who have received 'effective* instruction on the objective to be assessed." 



i. The Basic Situation 
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Berk suggests that these groups should be approximately equal in size and 
large enough to produce stable estimates of probabilities. Test items 
measuring one objective are then administered to both groups and the dis- 
tribution of scores (putting both groups together) can be divided by a 
cut-off score into two categories. 

Combining the classifications of students by predictor (test score) and 
criterion (instructed vs. non- instructed status) results in four categories 
that can be represented in a 2 x 2 table, with relevant marginals: 

1. True Master (TM) : an instructed student whose test score is above the 
cutting score (C) . 

2. False Master (FM) : A Type II misclassi f Ication error where an unin- 
structed student's test score lies above the cutting score (C) . 

3. True Non-Masters (TN) : An uninstructed student whose test score lies 
below the cutting point (C) . 

A. False Non-Masters (TO): Type I misclasslfication where an instructed 
student's test score lies below C. 
Tabularly, this can be presonted as follows. Note how the marginal are defined 
because they are used in the formulations to follow. 



CRITERION MEASURE 
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0/ 

o 
u 
to 

u 

O 00 

^ c 

U H 
•H 4i 

«> V 



Instructed 
(I) 



Uninstructed 
(11) 



Predicted 

Masters 

PM-TMfFM 


(TM) 


1 

Type TI 
(FM) 


Predicted 






Non-M.ritrtrs 


Type I 






(FN) 


(•IN) 


1 
1 


Masters 

M-TMfFN ' 

1 


Non-Masters 
N-FM+TN 
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I I. ClassLficn tl£n._"_f J} i jjfl^'^." 1 I'rob .ilii lit I cs 
In this procc>durc, identif icntion of the optimal oitting score involve, 
an analysis of the two-way classification of outcome prohabilities shown above. 
This can be done algebraically by follow] the .tep. ll.slrd holow. or ,;rnphlcally, 
as Illustrated in a subsequent section. The steps to follow are: 

1. Set up a two-way classification of the frequency distrihurJon for each 
possible cutting score. 

2. Compute the probabilities of the 4 outcomes (for each cutting score) 
by expressing the cell frequencies as proportions of the rota] sample. 
For Instancf 

Prob (TM) = TM/(M+N) 
Prob (FM) - FM/(hH-N) 
Prob (TN) - TN/(hH-N) 
Prob (FN) - FN/^H-N) 

3. For each cutting score, add the probability of correct decisions: 
Prob (TM) + Prob (TN^ and the probability of incorrect decisions: 
Prob (FN) + Prob (m) . 

A. The optimal cutting score is the score that maximize Prob (TM) + 
Prob (TN) and minimizes Prob (FN) + Prob (FM) . It is sufficient Co 
observe the score that maximizes Prob (TM) + Prob (TN) because [Prob 
(FN) + Prob (FM)] - 1 - (Prob (TM) + Prob (TN)]. That is, the score 
that maximizes the probability of correct decisions automatically minimizes 
probribliltv of Incorrect decisions. 
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lli. Graphical Solution 
Be^-k (1976) also mentions that the optimal ciittliiK point tor a 
criterion-referenced test can be located by observing the frequency 
distributions for the instructed and uninstructed groups. According to 
Berk: 

The instructed and uninstructed group score 
distributions are the primary determinants 
of the extent to which a test can accurately 
classify students as true masters and true 
non-masters of an objective. The degree of 
accuracy is, for the roost part, a function of 
the amount of overlap between the distribution. 

If the test distributions overlap, no decisions can be made. The 
Ideal situation would be one in which the two distributions have no 
overlap at all. A typical situation we should hope for is for the in- 
structed group distribution to have a negative skew, the uninstructed 
group to have a positive skew, and for there to be a minimum of overlap. 
The point at which the distributions intersect is then the optimal cut-off 
score. 

In Figure 6.7.1, the distributions of test scores for two groups 
of examinees (one instructed group and nne uninstructed group) are 
shown. 
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Figure 5.7.1 Frequency polygons of criterion-referenced test 
scores for two groups - an instructed group and an 
uninstructed group on the content measured by the test. 




3 4 5 
Test Score 



7 



Uninstructed Group (N=70) 
Instructed Group (N=80) 

Fou ^ypes of Examinees 

A: Non-Masters — 

Correctly Classified 
B: Masters — 

Incorrectly Classified 
^ C: Masters — 

Correctly Classified 
D: Non-Masters — 

Incorrectly Classified 
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Frequency Distribution of Test Scores 



Test Score 


U 1 GrouD 


1 GrouD 


8 


0 


7 


7 


2 


10 


6 


5 


18 


5 


8 


20 


4 


11 


15 


3 


18 


5 


2 


13 


3 


1 


9 


2 


0 


4 


0 
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In this procedure, n validity coefficient Is romput^'d for each possible 
cutting score. The rutting score yielding the highest validity coefficient 
also yields the highest probability of correct decisions. To utilize the 
procedure, the following steps should be followed: 

1. From the two-way classification introduced earlier, compute the 
base rate (BR) and the selection ratio (SK). They are given by: 

BR « Proh (FN) + Prob (TM) 
SR - Prob (TM) + Prob (FM) 

2. Calculate the phi coefficient 0^^ u:,ing tlie following formula: 

0 . Prob (TM) - BR (SR) 
/bT TT-Biy Tr (1 -SR)^ 

3. The cutting score yielding the highest 0^^ is the optimal cutting score- 
The formula for the phi coeff iclent^ 0^^, given above is suitable for a 

2x2 table of cell probabilities. More generally, the phi coefficient is 
the Pearson product moment correlation between two dichotomous variables, 
and could be arrived at as follows: 

1. Each student with a test score above the cutting score in question 
Is assigned a 1, below a 0. 

2. Each student in the instructed group is assigned a 1 , in the unln- 
structed group, a 0. 

^' ^vc ^^"^^ ^^^^ correlation coetficient computed in the usual 

way. 

Hi 
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V Uiility Ana lysis 

In this section, costs or losses are assiRned Lo the misclassilication 
of students OS false masters or false non-masters. The procedures here iive 
closely tied to the decision-theoretic procedures discussed in a later section. 
The procedure is presented at this point because it can he related to the 
two Berk procedures just discussed. 

First of all. Berk notes the following fact. 

When the outcome prohabilities or validity coefficient approach 
Is used to select the optimal cutting score, it is assumed 
that the 2 types of errors are equally serious. If, however, 
this assumption is not realistic in terns of the los.scs which 
may result from a particular decision, the error prohahll 1 ties 
need to be weighted to reflect the magnitude of he losses 
associated with the decision. 

Berk notes that determination of the relative size of each loss is ,u.gn,enral. 
and .ust be guided by the consequences of the decision considered. „o men- 
tions considering the following factors: Student motivation, teacher time, 
availability of Instructional materials, content, and others. Berk suggests 
the following, which we have capsulized Into a series of steps: 

1. Estimate the expected disutlllry of a decision strategy (O by 
- Prob (FN)IDJ + Prob (FM) (D^) 
where and D2 < 0 

and k - tl,e single decision in question 
and D2 ' respective disutility values 

^' e.xpected utility of a decision strategy (v) by 

Vy. - Prob (TM) lUj + Prob (TN) (Uj) 
where U^^ and U2 > 0 

and k - the single decision in question (same as for disutility) 
and U2 - respective utility values 
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3. Form a composite measure of test usefulness by combining the 
estimates of utility and disutility across all decisions 

n 
k=l 

Y * index of expected riaximal utility. 

4. Choose the cutting score with the highest Y index (it maximizes 
the usefulness of the test for decisions with a specific set of 
utilities and disutilities). 

vi. Suggestions 

The procedures developed by Berk (1976) hold considerable promise 
for use in setting criterion-referenced test score standards. The ideas 
in his procedures are now new; there are other procedures that are con- 
cerned with the maximization of correct decisions and the minimization of 
false-positive and false-negative errors. The attractive feature is the 
ease with which Berk's methods can be understood and applied. The major 
potential drawback is in the assignment of examinees to criterion groups. 
If many examinees in the "instruc-ed group" do not possess the assumed 
knowledge and skills measured by the criterion-referenced test (or if 
many examinees in the "uninstructed group** do), Berk's methods will pro- 
duce inaccurate results. 

6.7.2 Decision-Thsoretic P r ocedures 

Berk (1976) looked at the minimization of false-positive and false- 
negative decisions through the use of actual test data. He selects as 
optimal the cutting score that minimizes false-positive and false-negative 
errors. Another way to look at false-positive and false-negative errors 
is to assume an underlying distributional form for your data and then 
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observe the consequences of setting values, siivh .mH c Ins piWuc?*, b. 

upon the distributional model. The logic is the same here in terms of 

minimization of errors, except that by assuming . distributional form, 

actual data does not have to be collected. Situations can be simulated or 

developed, based upon the model. 

Meskauskas (1976) has related and compared these procedures to those 

based upon analyses of the content of the test. In reference to these 

models, of which we will describe one: 

• . .the models to follow deal with approaches that 
start by assuming a standard of performance and then 
evaluating the classification errors resulting from 
its use. If the error rate is inappropriate, the 
decision maker adjusts the standard a bit and tries 
his equation again. 

Before discussing one of the procedures in greater detail, the Kriewall 

binomial-based model, the procedures discussed here should be related to 

criterion-referenced testing procedures involving the determination of test 

length. Many of the test length determination procedures (Mlllman, 1973; 

Novick & Lewis, 1974) make underlying distributional assumptions and proceed 

in the fashion discussed above by Meskauskas. The focus of concern, however, 

is test length determination, and not the setting of a cutting score. In 

fact, Millman's (1973) procedure is based upon exactly the same underlying 

distribution, the binomial, as is Kriewall's model to be discussed. It 

should be pointed out that the procedures are exactly the same, the data 

is just represented differently because of the level of concern, either 

cutting score or test length. 
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1. Krlewall's Model 
Kriewall's (1972) model focuses on categorization of learners into 
several categories: Non-master, master, and an in-between state where 
the student has developed some skills, but not enough to be considered 
a master. 

Kriewall assumes the funr ion of measurement, using the test, is 
to classify students into one of two categories, master or non-master. 
Of course, the test, as a sample of the domain of tasks, is going to mis- 
classify some individuals as false-positives (masters based on the test, 
but non-masters in reality) and false-negativas (non-masters on the test, 
but masters in reality). By assuming a particular distribution, these 
errors may be studied. 

KriownllVs probability model, used to develop i lui likeJjliood of 
classifxcation errors, is based upon the binomial distribution. He assumes: 

1. The test represents a randomly selected set of dichotoinousiy scored 
(0-1) items from the domain. 

2. The .likelihood of correct response for a p.Lvcn indi vidua] is n 
fixed quantity for all items measuring a given ohjective. 

3. Responses to questions by an individual are Independent. That 
Is, the outcome of one trial (taking one question) is Independent 
of the outcome of any other trial. 

A. Any distribution of difficulty of questions (for an individual) 
• xthlM a test is assumed to he a function of randomly occurring 
erroneous responses (Meskauskas, 1976). 
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With these assumptions, Kriewall views n student *s test performance 
as "a sequence of independent Bernoulli trials, each having the same 
probability of success.'* A sequence of Bernoulli trials follouBa binotnial 
distribution, which has a probability function which relates the probability 
of occurrence of an event (a particular test score) to t>ie number of questions 
in the test by: 



where 



aiiil 



V .n. X n-x 
f(x) « ( )p q 

X 



X a Li'st ;>ri>re 

n = total number of tct>L items 

p = examinee domain Fcore 

q = 1-p 



X x! (n-x)! 



Kriewall sets *:nmc ]M>iuu!.irv vnin c \ 

the probability of miscl assif ication errors V.in. 

errors. Isinr, tl.o notation of rif^skauskas (1976), 



set: 



a, the lower ,ound of the .astery rnn,e (a. . proportion of errors) 
^2 = the upper bound of the non-mjstery rn.»r,c 

C = the cutting score; the .axi.al nu.ber of nlJowable errors for 



loasteis. Kriewall recommends C = -i I 

2 



Given v.l„es ror c„e a„„„. ehre. vari.Mes, Kr.,„a„ „.es u,e (a.,s™.., 
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and 3 is the probability of a false negative result (a .aster who scor.. m 
the non-mastery category), then a and B are given by: 



a - Z V ^1 ^1 - 



c-1 

w=o 2 ^2' 



where w = observed nu.ber of errors (and w = n-x) for .n Individual. 
According to Meskauskas (1976) the formula for a is: 

. . .equivalent to obtaining the probability that Riven 
arge number of equivalent trials, a person who^e' rue 

Will rail in the non-mastery range. 
By setting 2^ and a^ at various values, and determining C = 
t».e prohabaiccs of f.,.e positive faKe ne..rive orrors can b. .tudieC. 

The optimal value for C (and thus a^ and ap would then be the value that 
odnimized a and C. The results are dependent, however, on n and w. 

ii. Siij'.r.osL ions 
Whiic 'Criowall ha:; offcrfl ,, method of .'iiudyin}; fl.-i-sifi- 
cation errors thnt does not depend upon anua] dain, we prefer the 
method of Berk, due to its simplicity. KrlcwalJ's model scrms Lo 
"S to lit- in much better with iho procedures on test Icn^Lh 

determination. For instance. ,s,.ppo.-e you have r.prcifiod minim..] 
values for a and P. and have determined C. the cutting point. Then 
the formulas above for u and 6 can be solved for n, the total number 
of questions needed. (It would be much easier if one isolated n on 
the left hand side). This is exactly what is done vWien u;;jnf, i ho 
binomial model to f.olve the test length iirchlcm. 
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In sum. we prefer the Berk ineihod for observing probabilities 
of misclassif icntLon errors both becuibe of its simplicity and because 
of the lack of restricting underlying, dist ribut jonal assumptions. 
Kriewall's method does, however, offer a viable alternative for 
setting a cut-off score when actual test data cannot be collected. 

^jJ^^Empirical Mo dels H e pyidirvg 
Upon a Criterion Measure 

The models to be discussed in this section bear great resemblance 
to both Berk's and Kriewall's methods just discussed. They liave 
been separated from those two methods because these methods arc built 
upon the existence of an outside criterion measure, performance 
n.ea.sure, or true a!)i 1 Lty distribution. '1 he r e.>t ilsolf. and tlie 
possible cut-off scores, arc observed in relationship wLtii this out- 
side measure. The optimal cut-off is then chosen in reference to 
the- criteiion measnre. For in-.iauce, Li v i n); . ton ' s (j'J/5) utility- 
based approach leads to the selection of n cut-off .score that optimises 
a particular utility function. The procedure of Vander Linden and 
Mlllcnburgh (19/6). in contrast, leads to the sehcLioir of a cut-off 
score that niiramixes expected loss. 

In reference to the setting of performance standards based upon 
benefit Cmd cost) Millman (1973) has suggested that psychological 
and financial co,sts be considered: 

All things bring equal, a low passing .'.core 
should be used when the psychological and 
lin.mcial costs associated with a rem(«dial 
insrructlonal program are relatively high. 
That is. there should be fewer failings when 
the costs of failing are high. These 'Vosts*' 
might include lower motivation and boredom, 
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dani.ioe to scl f-concc<pt , and dollar and Limp 
expenses of condurcing a romodial instruc- 
tional program. A bigher passing score can 
be tolerated when the above coPts arc not too 
great or when the iiof.ative erfcctr. of moving 
a studenr too rapidly through a rurriculum 
(i.e., confusion, inefficient learning and 
so forth) are seen as very important to avoid. 

In sum, to utilize these procedures, a suitable outside criterion 

measure must exist. Success and failure (or probability of succcs-s 

and failure) is then defined on the criterion variabla and the cut-off 

chosen as the score on the test that iximizes (or minimizes) some 

function of the criterion variable. The existence of such a criut-rion 

variable has implications for the utilization of these methods for 

setting cut-off scores on minimum competency tests. 

ij_JiiyJ Jir/itJ^nJ^sJUt^l J t)r:;based Appro.ir li 
Livingston (1975) suggests the use of ;i s-t of liiKnr oi 
semi-linear utility functions in viewing the effects of decision- 
making accuracy based upon a particular performance standard or 
cut-off score. That is, the functions relating benefit (and cost) 
of a decision are related linearly to the rutting score in question. 
Livingston's procedure is like Berk's procedure for utility 
analysis discussed in 6.7.1 except that Livingston develops his 
procedure based upon any suitable criterion measure (not 

Just instructed versus uninstructed) , and also specifies the rela- 
tionship between utility (benefit or loss) and cutting scores as 
linear. The relationship does not have to be linear; Lowcver, using 
such a relationship simplifies matters somewhat. In such n situali 
the cost (of a bad decision) is proportional to the size of the 
errors made an d the benefit (of a good decision) is proportional to 
ERXC ^^^^ errors avoided. ^8 
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U_. V.in tier i-indon ,uiJ Ht-ncnliiirKli ' 
Appro . nil 

Tlic dovolopi-rs of this pro,-, 'dure li.ivc pr. ^, rilu>d /i iii'tluul for 
setting rutting scores that is rc-Iated both to iUrk's procedure .ind 
Livingston's. We will describe the procedure briefly and in the 
process relate it to Berk's work. A test score is used to classify 
examinees into two categories: Accepted (scores . bove the cutting 
score) and rejectrd (scores below). Also, a lalont ability variable 
is specified in advance and used to dichotomize the student popula- 
tion: Students above a particular point on the latent variable are 
considered "suitable" and below "not suitable." The situation may 
be represented as follows. 



I.."trnl V. triable 
Not M.it.il'l,. Suitahlo 
Y< d Y > 



Decis 


ion 


Accepted 
X > C 


"False -f" 


Si 






Rejected 

X < c 




"False -" 

So 


whiire C = 


cutting'. 


score on the criterion-referenced 


test 


d »= 


cut t mg 


score on the latent 


variable (0 _c 


'111). 


and 


J 


(i»J ^ 0,1) is a function of y and rrlatod In 


losi. (uuct 


ion : 


/^OO^y) for Y < d, 


X ^ (' 






L 


\^o'"> for Y > d. 
/ ^Ol(y) for Y < d, 


X ' r 
X > c 





^^j(y) for Y > d, X > C 
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The authors then specify risk (the quantity to be minimized) as the 
expected loss, and the cutting score that is optimal is the value of C that 
minimizes the risk function (expected value of loss). They simplify mat- 
ters (as does Livingston) by specifying their loss function as linear. 

In sum, while Van der linden and Mellenburgh have provided a method 
for setting a cut--off score on the test, they have offered little to help 
in setting the cut-off on tl;e latent variable. In a sense then, they have 
only transferred the problem of setting a standard to a different measure! 

ill. Livinpsjton^ s Use of Stochastic 
A pproxir.rit ion Techniques 

Livingston (1976) has developed procedurns for set ! inj; cut-off 

scores based upon stochastic approximation procedures. According to 

Livingston, the problem involving cut-off scores can be phrased as 

follows to fii stochastic procedures: "fn general, ihe problem is 

to determine what level of input (written test f:rore) is necessary to 

produce a j;Lveii response (performance), when measiirt ments of the 

response are difficult or expensive. The procedure, according to 

Livingston, is zs follows: 

1. Select a person; record his/her test score and measure 
his/her performance. 

2. If the person succeeds on the performance measure (if 
his/her performance is above the minimum acceptable), 
choose next a person witl\ a somewhat lower test score. 
If the person fails on the performance measure, choose 
a person with a higfier written test score. 

3. Repeat step 2, choosing the third peu.on on the l^asis ni 
the '-.ccoful person'r. m*»asurc(l per f f»rmanc<^ . 

Er|c 7a) 
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Livirn'.ston c.frir.s two different procccl.in.s for c lic.os j step :Uzo. 
the up-and-dowi, and the Robbins-Monro Procedure, and a number of 
procedures for cstimntiuK minimum passing scorosconsonant with each. 

•Ihii. procedure, like ihos. d i •.russc.d ...rlur in this section, 
doprnds upon 'h.- rxi-.tonce of ., c.t-scn <• e-u .,hl i ^h. d on anotbor 
variable, this time t ht- performance measure, in -..dor to cslablish 
the passing score on the test. Thi« then limits ;,rc..tly the applica- 
bility of the method. LivinRsto,. (personal co.nn.u.ication. 1978) 
has suggested that iudp.mental data on performance can be used, 
rather than actual performance data, with the procedure, but this 
has yet to be documented in any fashion. Wien documented, the 
possibilities for use of the procedures will be j-reatly expanded. 

iv. Huynh's Procedures 

Huynh (1976) has advanced procedures for setting cut-off scores 
that are predicated on the existence of a "referral task." This 
referral task can be envisioned as an external criterion to which 
competency can be related. For instance, Huynh (1976) states that 
"Mastery in one unit of instruction may not be reasonably declared if 
it cannot be assumed that the masters would have better chances of 
success i-ithe next unit of instruction." The next unit in this case 
would be the referral task. 

These procedures once again depend upon an outside criterion 
variable to permit the estimation of a cut-score. In 
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Uli.'J tMS.-. tlu- i.S.M- o( (I u-t I.O.I .,.;K,.,| i ,|,|,.|, , ,., ,.,,.!. 

ability of surro.-.s of jnd jvi,lu,i1.s on t l,c ..fer.,.! r.mk. Hcciusc 
of the necessity of a criterion variable for operation, thesp pro- 
cedures suffer in generalizability. They /,re, for instance, 
apparently not useful for minimum competency te5:ting situations wheie 
a criterion variable, and associated prob.ibility of success, are 
next to impossible to establish. 

6.7.4 Educational Conseq uences 

In this situation, one is concerned with looking at the effect setting 
a standard of proficiency has on future learning or other related cognitive or 
affective success criteria. According to Millman ' (1973) , the question here is 
"What passing score maximizes educational benefits?". 

This approach can be visualized from an experimental design point of 
view. A subject matter domain is taught to a class of students who are then 
tested on the material. These students are assigned (randomly) to groups 
with the groups differing on the performance level required for passing the 
test. The students are then assessed on some valued outcome measure and the 
level of performance on the criterion-referenced test for which the valued 
outcome is maximal (it could be a combination of valued outcomes) becomes 
the performance standard or criterion score. 

Thus, to use this method, much more data needs to be collecrrii than 
for tl.e item content procedures. An experiment must be conducted, and 
then a cut-off score is selected based upon the results of the experiment. 
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Because of the difficulties involved in designing and carrying out 
experiments in school settings, the method is unlikely to find much 
use. 



i. Block's Study 

Block's study (1972) involves students learning a subject segment, on matrix 
algebra using a Mastery Learning paradigm. Such a paradigm dictates that 
students who don't perform adequately on the posttest be recycled through 
remedial activities until they demonstrate mastery (re; attain a score ahove 
the cutting score). Hlock established four groups nf students, where each 
group was tested using one of the following four performance standards: 65, 
75, 85, and 957. of the material in a unit must be mastered before proceeclii*g 
on the next unit. He then examined the effects of varying the performance 
standard on six criteria that were used as the variables to be maximized. 
Viewing these criteria as cither cognitive or affective, Block observed 
that the 95X performance level maximized student performance on the cognitive 
criteria, while the 85% performance level seemed to naximize the affective 
criteria. 



Come comments on Block's study are in line. One, the results lack f;eneral-- 
Izabllity. The 957. and 85% levels, which maximize the cognitive and affective 
measures respectively, are likely tc change with the subject matter. 
Two, as pointed out by Glass (1978a), the method of 
maximizing a valued outcome assumes that there Is a distinct point or 
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criterlon score on the CRT that mAxlmlzc3 the outcome. What If the curve 
relating perf onn.».nce on the CRT Is monotonlco lly Incrcdolng, oo tint: 100% 
performance on the CRT maximizes the valued outcome? In fact, 
it is more likely to be the case that the graph is monotonically increasing 
than the case where the graph increases and decreases. For example: 
!• Monotonically increasing graph (Problem situation) 



Valued 
Outcome 




100% 



CRT 



2. Ideal situation 



Valued 
Outcome 




0% 



70% 



CRT 



100% 



(Reproduced from Class. 1978a, permission for reproduction pending.) 

Thus, It can be «een that unless the graph Increases and then decreas^^a, 
a 10G% performance standard will be opilm^K Thl:. st.mdard is of linitod use 
because it is not realistic to expect all students to attain that level. 
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Third, Block discusses that 1f there arc mulllple critt-rL.. to he 
maximized as valued outco.-ne.K, then somr nodtl I „r comb I ,n crii.^rla with televant 
weights needs to be developed. hc docs not offer any procedures for 

doing so however, and he looks at the effects of the performance standards 
on each of the 6 criteria separately. It should be noted that multiple 
criteria is a way around the problem discussed above (Glass, 1978a). For instance. 
If one of the outcomes has a monotonically increasing relationship with the 
test scores and the other a monotonically decreasing relationship, then the 
composite should have a peak value at a point other than 0% or 100%. \Thile 
this would seem to solve the problem, another probloin is only further 
exacerbated; what weights should be assigned to the val.ied outcomes to 
form the composite'. Thes . procedures have not yet beer developed, and fur- 
ther, they are likely to be situation specific. 
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6.8 Combination Methods 

6.8.1 Judgemental-Empirical 

Zieky and Li>^ingston (1977), and more recently, Popham (1978), have 
suggested two procedures that are based upon a combination of judgmental 
and empirical data. In addition, both Zieky and Livingston and Popham havt 
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incUidoU .'in in-deptli diMu .'^ioii (»r how Lo unp I cincnl. tlio pvocodu res , 
something tli.aij'^' l)een Inckinj', \n i h nnnv ollio procedure's. The 
two procedures piesented by Zicky ,iiul l.i v i n;'/'l on , ihc Ji<»rder 1 i ne- 
Group and Contrast ing-Groups methods, are proccdiica] 1 y similar. 
They differ in the sample of students on whicli performanrc data is 
collected. Further, while jud>;ments arc required, the jiidj;monts 
necessary are on students; not on Items, as are many of the other 
judgmental methods (Nedelsky, Angoff, Ebel, etc.)- Zieky and 
Livingston make the case that judging individuals is likely to be a 
more familiar task than judging items. Te.fchors nrc the logical 
choice as judges, and for them, the assessment of individuals is 
commonplace. 

i. Borde rline — Group Method 
Tills motliod requires that judges first 
define what they would envision as minimally acceptable performance 
on the content area being assessed. The judges are then asked to 
submit a list of students (about 100 students) whose performances 
are so close to the borderline between acceptable and unacceptable 
that they can't be classified into either group. The test is 
thus administered to this group, and the median test score for the 
group is taken as the standard. 

Once Judges have defined ruinimally acceptable performance for 
the subject area being assessed, the judges arc asked to identify Lliose 
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students they are sure are either definite masters or non-masters 
of the skills measured by the test. Zieky and Livingston suggest 
100 students in the smaller group in order to assure stable results. 
The test score distributions for the two groups are then plotted and 
the point of intersection is taken as the initial stand rd, ' lis is 
exactly th,? same as the graphical procedure suggested by Berk, and 
presented in section 6,7.1, Zieky and Livingston then suggest ad- 
justing the standard up ot down to reduce "false masters" (students 
identified as masters by rhe test, ^nt who have not adequately mastered 
the objectives) or "false non-masters" (students identified as non- 
masters by the test, but who have adequately mastered the objectives). 
The direction to move the cut-off score depends on the relative 
seriousness of the two types of errors, 

i ii. Suggestions 
These methods, particularly the Con t ras UnK-Groups Method, are 
very similar to the procedure suggested by Berk. Instead of c.ctually 
forming instructed and uninstructed groups, however, as suggested by 
Berk, the Contrasting^Groups Method asks judges to form the groups. 
This judgmental procedure would seem more ndv,int n^^eoiis when the content 
being assessed has had a long instructional period (minimum compelpncy 
testing is an example), or when there would bo problems justifyinr, 
the existence of an uninstructed group. Berk's method would be more 
useful for tests based on short instructional sor,m(Mi;<., most likely 
admin i'.tcr.'d ,iL tlie rLir.sroom levi'l. 

A comparison of the judgments involved in tiio two procedures 
indicates that the Contrasting-Groups Method would he tiio easier 
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method to justify using. It is a more reasonable task for teachers 
to identify "sure" masters and non-masters than it is for them to 
identify borderline students in the subject area being assessed. In 
sum, the Contrasting-Groups Method appears to us to be a most reason- 
able way of setting standards. 

6.8.2 Bayesian Procedures 

Novick and Lewis (1974) were the first to suggest that Bayesian 
procedures are useful for setting standards. Schoon, Gullion, and 
Ferrara (1978) have more recently discussed Bayesian procedures for 
setting standards. According to Schoon et al., Bayes .an procedures 
allow the incorporation of: 

1. A loss ratio, reflecting the severity of false-positive 
and false-negative decision errors, 

2. prior information on the distribution of domain scores in 
the population of interest, 

3. current information on an examinee's domain score, and 

4. the degree of certainty that an examinee's domain score 
exceeds the cut-off score. 

Of course, a cut-off score must first be set in order for the four 
factors to be incorporated. Thus, Bayesian procedures offer a way 
of augmenting the establishment of a cut-off score rather than a 
method for setting the cut-off score itself. 
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In sum, Hayes i. in procedures present a method for augrrmnl j ng 
the settii of a fut-off score by utilizing', avail. liilc prior and 
collatoral information. The procedure also provides a posterior 
statement of degree of certainty about candidate's performance. 
Bayesian procedures do not, however, offer a inethod for setting a 
cut-score in the first place. Rayesian procedures have hern included 
in this review because they do offer a method f(u* ( oml) Ln i nj', judgemental 
and empirical data to arrive at a revised standard. 
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6.9 Some Procedural Steps in Standard Setting 

In earlier sections of this unit, issues and many methods for standard- 
setting were discussed. In this section, procedures will be outlined 
for setting standards on criterion-referenced tests used for three dif- 
ferent purposes. The purposes considered are: 

1. Classroom testing 

2. Basic skills testing for yearly promotion and high school 
graduation 

3. Professional licensing and certification testing- 
Classroom testing is emphasized since classroom teachers have fewer 
technical resources available to them than do the larger testing programs. 
Our ultimate objective is to provide a comprehensive set of practical 
guidelines for practitioners. At this time the guidelines are far from 
comprehensive; much research is needed to supply information necessary 

to construct thorough guidelines. We have suggested in places some of 
the questions that need to be answered. 

Certain things are assumed: first, that in each case a set of 
objectives or competencies has been agreed upon, and that Ihey are 
described via the use of domain specifications or some other equally 
appropriate method. Second, it is assumed that no fixed selection ratio 
exists (e.g., one might be fixed in effect by having resources to provide 
only a certain number of students with remedial work) since if it does 
there is no reason to set standards. Finally, we do not discuss the 
important and interesting political issues of who participates in and 
who controls the standard-setting process; we take as given that some 
such process exists and only address the issue of participation from the 
perspective of practicality. 
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6.9.1 Preliminary Considerations 

Before any standard setting is undertaken for any purpose, an 

analysis of the decision-making context and of the resources available 

♦ 

for the project should be done. The results of this analysis will 
determine how extensive and sophisticated the standard-setting procedure 
should be. Analysis of the decision-making context involves iudging 
the importance of the decisions that are to be made using the test, 
the probable consequences of those decisionj, and the costs of errors. 
Others have discussed using these same considerations in adjusting the 
final standard, but they may also be helpful in choosing a standard- 
setting method. Formal procedures for using this information ar^ 
probably not necessary; a discussion of the Issues by those directing 
the project should suffice. Some issues to consider would include (1) 
the number of people directly and indirectly affected by the decisions 
to be based on the test; (2) possible educational, psychological, 
financial, social and other consequences of the decisions; and (3) 
the duration of the consequences. 

The next step should be a consideration of the resources available 
for the standard setting. Resources include money, materials, clock time, 
personnel time and expertise. How much of the total amount of available 
resources will be dedicated to the standard setting will depend upon the 
results of the prior discussion of decision context. Tho final decision 
as to the resources to be invested will determine how large and tech- 
nically sophisticated the standard-setting enternrise may be. 

A great deal of information needs to be collected on the actual ex- 
penditures of various resources that have been required to carry out 
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standard setting by different Ine^hods in different contexts. Actual time 
and money data would be invaluable to practitioners in choosing a 
method for their own situation. In the following discussion procedural 
stdps in increasing order of expense and complexity will be offered but 
real data on these factors is lacking and is a pressing need. 

6.9.2 Classroom Testing 

The classroom teacher is most likely to use criterion-referenced 
tests for diagnostic purposes, that is for determining whether a student 
has mastered an area or needs further work in it. This would seem to 
be the most common situation calling for the setting of standards. Here 
the teacher must decide what level of test performance constitutes 
"mastery." In the same testing context the teacher may set additional 
performance standards, above and/or below the minimal level, for the 
awarding of grades on the material. 

Typically the classroom teacher works alone, or at most with one 
or more other teachers of the same grade. It is also quite often the 
case that a classroom exam is used only once. In these situations methods 
based only on judgment of test content may be the onlv ones practicable. 
The methods developed by Ebel, Nedelsky and Angoff would be appropriate 
here, and the details of each of them have been discussed in an earlier 
section, go we will not re-iterate procedural steps here. 

^/hen available resources permit involving more people in the standard 
setting, parents and other community members might be enlisted, or a group 
of teachers of one grade from an entire school district might collaborate 
in setting standards. Again, if resources permit, data on group performance 
on individual items may be tabulated and considered in setting the standards 
^ on subsequent tests, or if tests are retained from year to vear, the 
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performance data from the previous year might be used. Of course, this 
can also be done by teachers working alone. The following is a list of 
steps, some of which could be omitted if resources were limited, for 
involving parents of students in a particular class in setting standards 
for classroom tests over units of instruction. The method borrows 
heavily from Jaeger (1978). (It is assumed that the objectives have 
been identified and the teacher (or teachers) has prepared domain speci- 
fications) : 

1. At the beginning of the school year, a letter is sent 
to parents explaining the project and inviting them to 
a meeting where more information will be given. 

2. At the meeting parents are given copies of domain speci- 
fications for the first test, along with example items. 
They are asked to indicate for each objective a percentage 
cf items, which answered correctly would demonstrate the 
student had mastered the material adequately . At this 
meeting they should be encouraged to discuss the task 

and ask any questions they might have about it. 

Instructions accompanying the standard-setting task should indicate to 
the parents how their judgment will be employed (for example, averaged 
with the percentages indicated by every other parent, and the resulting 
standard applied to every child in that class or grade). We have sug- 
gested for reasons of test security that the parents base their iudgments 
on domain specifications rather than on actual test itemp; if test 
forms from previous years are available and thought to be parallel to 
the new exam, it may be easier for parents to make their judgments as 
a percentage correct of items on the parallel test. 

3. The teacher constructs the criterion-referenced test 
from the domain specifications before looking at the 
parents * standards. 

4. Class performance data is tabulated after the test is 
administered. 
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5. Parent judgment for the second test (or set of tes^s) 
is solicited by mail. The mailing packet includes: 
domain instructions (duplicating those given at the 
earlier meeting), and performance data from the first 
test (number of students achieving each set standard). 

Instructions would also stress that judgments were to be based primarily 

on domain specifications and only secondarily on perfonnance data. 

6. Step 5 is repeated during the year whenever a competency- 
type test is to be given. 

Alternatively, this procedure might be reserved for those instructional 

units judged to cover basic, required objectives for that grade; parents' 

instructions would then identify the tested materials as such. 

7. The teacher keeps files for each test, including the 
domain specifications, parent judgment forms, actual 
exam and performance data. 

8. Periodic meetings can be held to review the instructions 
and to discuss the procedure and its results. 

Such discussions may lead to parents questioning the performance of students, 
and is likely to provoke query into both the teacher's methods and his/hei 
subject matter. Teachers should be prepared for this; it may lead to 
parents wanting greater involvement in determining other aspects of their 
children's schooling, a desire one hopes can be creatively and construc- 
tively used. 

Other variants on this procedure can include appointing a small 
coiwnittee of parents, possibly working with several teachers, instead 
of an open parents group. A parent-objective (matrix) sampling strategy 
could be employed to reduce the number of judgments required of each 
parent. 
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Another procedure for setting standards with criterion-referenced 
tests in instructional settings was offered by Hambleton (1978). Ac- 
cording to Hambieton, "[His] is not a 'validated list' of guidelines. 
It is a list of practical guidelines I have evolved over the years 
through my work with numerous school districts." His eleven step list 
of guidelines is as follows: 

1. The determination of cut-off scores should be done by several 
groups working together. These groups include teachers, parents, 
curriculum specialists, school administrators, and (if the tests 
are at the high school level) students. The number from each 
group will depend upon the importance of the tests under con- 
sideration and the number of domain specifications. At a 
minimum, I like to have enough individuals to form at least 

two teams of reviewers. This way I can compare their results 
on at least a few domain specifications to determine the 
consistency of judgments in the two groups. When sufficient 
time is available I prefer to obtain two independent judgments 
of each cut-off score. 

2. I usually introduce either the Ebel method or the Nedelsky 
method. Following training on one of the methods, I have the 
groups work through several practice examples. Differences 
between groups are discussed and problems are clarified. 

3. The domain specifications (or usually, but less appropriate, 
the objectives) are introduced and discussed with the judges. 

4. I try to set up a schedule so that roughly equal amounts of 
time are allotted to a consideration of each domain specifica- 
tion. If some domain specifications are more complex or 
important I usually assign them more time. 

5. I make sure that the j udges are aware of how the tests will b^ 
used and with what groups of students. 

6. If there exist any relationships among the domain specifications 
(cr objectives) the information is noted. For example, if a 
particular objective is a prerequisite to several others it 

may be desirable to set a higher cut-off score than might other- 
wise be set. 

7. Whenever possible I try to have two or more groups determine 

the cut-off scores. Consistency of their ratings can be studie<^, 

and when necessary, differences can be studied, and a consensus 
decision reached. 
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8. If some past test performance data are available, it can be 
used to make some modifications to the cut-off scores. On 
some occasions, instead of modifying cut-off scores, decisions 
can be made to spend more time in instruction to try and im- 
prove test performance. If past ^roup performance on an 
objective is substantially better than the cut-off score, less 
time may be allocated to teaching the particular objective. 

9. As test data become available, percentage of "masters" and 
"non-masters" on each objective should be studied. If per- 
formance on some objectives aopears to be "out of line," 

an explanation can be sought by a consideration of the test 
items (perhaps the test items are invalid), the level of the 
cut-off score, variation in test performance across classes, 
a consideration of the amount of instructional time allotted 
to the objective and so on. 

10. Whenever possible I try to compare the mastery status of 
uninstructed and instructed groups of examinees. Instructed 
groups ought to include mainly "master" students. The unin- 
structed groups should include mainly the "non-masters." If 
many students are being misclassif ied, a more valid cut-off 
score can sometimes be obtained by moving it (for example, 
see Berk, 1976). 

11. It is necessary to re-review cut-off scores occasionally. 
Curriculum priorities change and so do instructional methods. 
These shifts should be reflected in the cut-off scores that 
are used. 

There are many important questions needing to be researched. These 
techniques have apparently been used very little (there is certainly 
much more literature on how to set standards than on what happens when 
one does); we need to know the effects of involving different groups of 
people in the standard-setting (especially parents as opposed to others), 
of the number of people involved, the information and instructions pro- 
vided and the frequency of standard setting. How do these factors effect 
the levels set, the public acceptability of the chosen standard, and are 
the procedures cost-effective? 
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6>9.3 Basic Skills Testing for Annual 

Promotion and High School Graduation 

These are clearly areas where greater importance is attached to the 
consequences of testing and, hence, more resources will be allocated than 
for classroom testing. The discussion is limited here to testing of 
"minimal*^ competencies, not intending that the procedures be applied 
to the total curriculum. Further, we are not discussing the "life skill" 
or "survival" competencies; in setting standards for these skills it is 
necessary to consider performance on criterion measures of life success. 
We feel that this undertaking is beyond the capabilities of educational 
and measurement practice. It will be difficult enough to decide upon and 
assess "minimal" skills. For these skills, since no external criterion 
measures can be said to exist, the appropriate performance data to 
consider in standard setting are scores on the actual tests (or items) . 
We agree with those (e.g., Jaeger, 1978; Linn, 1978; Shepard, 1976) who 
hold that performance data should be considered along with test content 
to inform the setting of standards. While from an idealistic point of 
view it would be desirable to set standards with reference only to the 
content of a domain, in reality the degree of skill in test construction 
required for the pure-content approach is probably beyond human attain- 
ment. In order to avoid unpleasant shocks it would seem good practice 
to examine test performance data; the other benefit of so doing is that 
feedback is received on our content-based judgments and may thus rei^ine 
our skills. 

Jaeger (1978) has provided an excellent guide to implementing a 
procedure involving representative groups affected by standards set for 
high school graduatji.n. The method was discussed earlier, but a brief 
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review at this point seems useful. In ^ontMal ti-im-., it (s ittM.KUf 
procedure lor soliciting item-by-item judgments from groups of judges. 
Information fed back to the judges at each iteration includes (a) group 
performance on each test item in a pilot administration, (b) the per- 
centage of students who would have passed given several different stand- 
ards, and (c) a distribution of the standards suggested by the judges in 
the group. The median passing score for each type of judge is computed, 
and the lowest of the medians taken as the standard. 

The principal attraction of plans such as Jaeger's and the one out- 
lined in Section 6.9.2, which is based on Jaeger's, is their political 
viability. By involving a broad cr.^ss-section of constituents in the 
retting of the standard, one increases the acceptability of that standard. 
However, no actual control or very significant influence over the educa- 
tional process is transferred to the constituency; the objectives and the 
test, after all, are pre:>ented to them as givens, and their contribution 
in setting the standard is really quite limited. Moreover, the consensus 
method, while probably not harmful, may not produce results that make any 
pedagogical sense. Where obtaining popular support is not a critical 
problem, educators may prefer to rely upon the judgments of subject- 
matter and measurement "experts" to set standards. This may produce a 
more coherent, if less universally-accepted, result. Such a procedure 
could be implemented as follows (the steps would be executed for each 
subject matter area by content experts working with measurement experts): 

1. Categorize the educational objectives or competencies 
as being of the knowledge/information type or of the 
rule-learning type (this distinction corresponds to 
Meskauska's (1976) continuum vs. state mastery models). 

In the first case it makes sense to speak of a domain score, and to sample 

randomly from the domain to estimate that score. In the second, since 
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learning is presumed to be all-or-ncne, sampling considerations are not 
relevant, but construction of a few test item? that accurately reflect 
the ability is critically important. Objectives domains of the first 
type reflect Ebel's (1978) notion of the purpose of competency certi- 
fication tests as being efficient and accurate indicators of the level 
of achievement in a broad domain, rather than lists of specific compe- 
tencies attained. 

2. For objectives or competencies of the first type, construct 
tests with the aid of domain specifications, items matched 
to the domain specifications, and a suitable item sampling 
plan. 

3. Ebel's standard-setting method (or one of the other content- 
focused methods) may then be used to set the standard for 
these parts of the test. To use Ebel's method the items 
frotn all of the knowledge /information (or continuum) domains 
would be considered together. (Table 6.9.3 provides a com- 
parison of six possible methods.) 

4. Pooling the judgments of all the experts may present a 
problem. Simply averaging the ratings given to each item 
(on relevance and difficulty) and/or the standards assigned 
to each category, will probably not give a very meaningful 
result. Ideally, the experts will go through a series of 
iterations in which they compare their independent judgments 
(first of the item categorization and next of the standards 
they assigned to each category), note discrepancies, discuss 
the rationale for each judgment, possibly decide upon re- 
visions in the test (this will direct the procedure back to 
Step 2, to ensure that any revisions do not distort the 
test's domain representativeness), and/or p rsuade each 
other to change their judgments. Unanimity might be re- 
quired in order to proceed from this step. 

For those objectives or competencies classified as being of 
the "State" variety, smaller sets of items are required 
since the domains are more homogeneous, but item construc- 
tion must be, if anything, more painstaking. Ideally, 
experimental evidence would be garnered to show that item 
performance truly reflected the target construct. 

6. Standards on these State-type objectives can be adjusted 
back from 100% using Emrick's (1971) technique if the 
probabilities of Type 1 and Type 2 classification errors 
can be estimated. Similarly, domain scores can be adjusted 
by a Bayesian procedure (e.g., Hambleton & Novick, 1973) 
to compensate for relative losses associated with the classi- 
fication errors. 
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Table 6.9.3 





A Comparison of 


Several 


Standard 


Setting Methods 










Judgmental 


Combination 




Question 


Nedelsky 


Modified 
Nede Isky Angof f 


Modified 
Angof f Ebel 


Jaeger 


Contrasting 
Groups 


Borderline 
Groups 


1. 


Is a definition of the 
minimally competent 
Individual necessary? 


Yes 


Yes 


Yes 


Yes 


Yes 


No 


No 


Yes 


2. 


What Is rhp nariirp of flip 
rating task — or Items, or 
Individuals? 


Items 


Items 


Items 


Items 


Items 


Items 


Individuals 


Individuals 


3. 


Are examinee data needed? 


No 


No 


No 


No 


No 


No 


Yes 


Yes 


4. 


Do judges have access to 
the Items? 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Usually, 
but don't 
need to 


Usually 


5. 


Are the judgments made 
In a group setting or 
Individual setting? 


Both 


Both 


Both 


Both 


Both 


Both 


Individual 


Individual 
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When the tests are used for yearly promotions, students' performance in 
the next grade can be used as a criterion in order to estimate the 
probabilities of classification errors. 

Research is needed on ways of pooling the judgments of several 
individuals, and of incorporating performance data in primarily content- 
based judgments. 

6,9,4 Prof ession;^! Licensing/Certification Testing 

Tests for licensing and certification differ from th- others dis- 
cussed here in having an external criterion, job performance, which the 
tests should predict. In addition, these tests are subject to govern- 
mental regulations and court rulings on the adequacy with which they 
reflect requisite job skills (and nothing more). Recent court decisions 
affirm that content validation of a test against the domain of entry- 
level job skills is sufficient to demonstrate that the test itself is 
fair. However, any standard used must also bear a rational relationship 
to job performance. 

One method that will probably be acceptable in the courts is to base 
the standard on experts' judgments of the importance of each tested 
item to adequate job performance; that is, to use one of the content- 
oriented methods to determine a percent correct for passing. The pooled 
judgments of a large number of expert practitioners would be desirable. 

Data on test performance would not be particularly useful in this 
situation since there Is usually not any pre-existing knowledge or 
belief about thr distribution of job-preparedness in the population. 
Empirical da'^a on criterion (job) performance would be useful were it 
not for the pervasive selectivity of professions; to use criterion 
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performance properly in establishing optimal passing scores requires an 
unselected population of job-holders. For these reasons, content- 
oriented procedure's for setting standards are probably the most viable 
procedures in licensing and certification. 
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the standard and with the highest discrimination indices are selected 
for the test.Whethei jadges can reliably set standards from only domain 
specifications and some sample test items is unknown. Also, it is not 
known if standards set by these two different methods will produce 
different results. This is one of those situations where similar 
results across two methods would be highly desirable. 
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6.10 Suminary 

In this unit, a number of viable methods for setting standards 
were introduced. If you wish to view the test by itself and not in 
relationship to other variables, either Angoff's method or Nedelsky's 
method appears to be useful. If empirical data is available, Berk's, 
method or the Constrasting Groups method seene especially useful. We 
have also discussed other niethods, of a more complex nature, that are 
suitable for setting criterion-referenced standards. Our preference 
for the methods mentioned above stems from the fact that they are 
simple to implenient:, and appear to produce defensible results v^hen 
applied correctly. In the final section of the paper, c^ne proposed 
sets of procedures for standard setting with respect to three important 
uses of criterion-referenced tests were outlined. However, considerably 
more research must be done before these procedures can be recommended 
for wide-scale use. 

We will conclude this unit with a brief discussion of a very im- 
portant problem. Suppose a set of test items have been selected. If 
so. It is then pos.,ible to set standards via either judgmental or 
empirical methods (or both). However, if a standard can be set via 
reference to well-defined domain specifications, and sample test items, 
tests which will optimally discriminate (i.e., reduce the number of 
misclassif icaiions) in the region of a standard can be constructed. This 
is done by selecting test items whirh "discriminate" in the region of 
the standard Tt-u items are piloted on samples of examinees similar to 
those who will eventual Jy be administered the tests to determine item 
difficulty lev('ls and discrimination indices. Items with p values near 
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